On Abruptly-Changing and Slowly-Varying Multiarmed Bandit Problems
نویسندگان
چکیده
We study the non-stationary stochastic multiarmed bandit (MAB) problem and propose two generic algorithms, namely, the limited memory deterministic sequencing of exploration and exploitation (LM-DSEE) and the SlidingWindow Upper Confidence Bound# (SW-UCB#). We rigorously analyze these algorithms in abruptly-changing and slowlyvarying environments and characterize their performance. We show that the expected cumulative regret for these algorithms under either of the environments is upper bounded by sublinear functions of time, i.e., the time average of the regret asymptotically converges to zero. We complement our analytic results with numerical illustrations.
منابع مشابه
Optimal Policies for a Class of Restless Multiarmed Bandit Scheduling Problems with Applications to Sensor Management
Consider the Markov decision problems (MDPs) arising in the areas of intelligence, surveillance, and reconnaissance in which one selects among different targets for observation so as to track their position and classify them from noisy data [9], [10]; medicine in which one selects among different regimens to treat a patient [1]; and computer network security in which one selects different compu...
متن کاملOn the Optimal Reward Function of the Continuous Time Multiarmed Bandit Problem
The optimal reward function associated with the so-called "multiarmed bandit problem" for general Markov-Feller processes is considered. It is shown that this optimal reward function has a simple expression (product form) in terms of individual stopping problems, without any smoothness properties of the optimal reward function neither for the global problem nor for the individual stopping probl...
متن کاملIndex Policies for Discounted Bandit Problems with Availability Constraints
A multiarmed bandit problem is studied when the arms are not always available. The arms are first assumed to be intermittently available with some state/action-dependent probabilities. It is proven that no index policy can attain the maximum expected total discounted reward in every instance of that problem. TheWhittle index policy is derived, and its properties are studied. Then it is assumed ...
متن کاملAsymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching - Automatic Control, IEEE Transactions on
We consider multiarmed bandit problems with switching cost, define uniformly good allocation rules, and restrict attention to such rules. We present a lower bound on the asymptotic performance of uniformly good allocation rules and construct an allocation scheme that achieves the bound. We discover that despite the inclusion of a switching cost the proposed allocation scheme achieves the same a...
متن کاملAn approach for handling risk and uncertainty in multiarmed bandit problems
An approach is presented to deal with risk in multiarmed bandit problems. Specifically, the well known exploration-exploitation dilemma is solved from the point of view of maximizing an utility function which measures the decision maker’s attitude towards risk and uncertain outcomes. A link with the preference theory is thus established. Simulations results are provided for in order to support ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1802.08380 شماره
صفحات -
تاریخ انتشار 2018